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ABSTRACT 

Motion  detection  using  background  modeling  is  a  widely 
used  technique  in  object  tracking.  To  meet  the  demands  of 
real-time  multi-target  tracking  applications  in  large  and/or 
high  resolution  imagery  fast  parallel  algorithms  for  motion 
detection  are  desirable.  One  common  method  for  back¬ 
ground  modeling  is  to  use  an  adaptive  3D  median  filter 
that  is  updated  appropriately  based  on  the  video  sequence. 
We  describe  a  parallel  3D  spatiotemporal  median  filter  al¬ 
gorithm  implemented  in  CUDA  for  many  core  Graphics  Pro¬ 
cessing  Unit  (GPU)  architectures  using  the  integral  histogram 
as  a  building  block  to  support  adaptive  window  sizes.  Both 
2D  and  3D  median  filters  are  also  widely  used  in  many 
other  computer  vision  tasks  like  denoising,  segmentation, 
and  recognition.  Although  fast  sequential  median  algorithms 
exist,  improving  performance  using  parallelization  is  attrac¬ 
tive  to  reduce  the  time  needed  for  motion  detection  in  order 
to  support  more  complex  processing  in  multi-target  tracking 
systems,  large  high  resolution  aerial  video  imagery  and  3D 
volumetric  processing.  Results  show  the  frame  rate  of  the 
GPU  implementation  was  60  times  faster  than  the  CPU  ver¬ 
sion  for  a  IK  x  IK  image  reaching  49  fr/sec  and  21  times 
faster  for  512  x  512  frame  sizes  reaching  194  fr/sec.  We 
characterize  performance  of  the  parallel  3D  median  filter  for 
different  image  sizes  and  varying  number  of  histogram  bins 
and  show  selected  results  for  motion  detection. 

Keywords 

spatio-temporal  median  filtering,  3D  median,  GPU,  integral 
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1.  INTRODUCTION 

Motion  detection  is  often  a  critical  early  stage  of  video 
object  tracking.  A  common  approach  is  to  learn  the  sta¬ 
tionary  part  of  a  video  sequence  using  background  model¬ 
ing  and  detect  moving  foreground  objects  using  background 
subtraction  [1,  2,  3].  Reliable  background  subtraction  based 
motion  detection  depends  on  several  parameters  such  as 
spatio-temporal  illumination  compensation,  shadow  detec¬ 
tion,  background  update  and  the  background  model  learn¬ 
ing  rate.  In  terms  of  background  modeling,  fast  learning 
leads  to  better  adaptation  to  background  changes  but  often 
results  in  more  spurious  detections  from  small  background 
motions  or  missed  detections  when  the  object  moves  slowly 
or  becomes  temporarily  stationary  and  blends  with  the  back¬ 
ground  model.  On  the  other  hand  slow  learning  can  result 
in  ghosting  artifacts  [4,  5].  The  Median  filter  is  a  spatio- 
temporal  nonlinear  filter  used  for  image  enhancement  and 
analysis  [2,  6].  Median  filter  is  commonly  used  in  many 
computer  vision  tasks  like  segmentation,  detection,  track¬ 
ing  or  recognition  to  discriminate  moving  objects  from  the 
background  [1,  7,  8].  In  this  case  the  motion  detection  filter 
is 

\I{x,y,c)  -  IMed(x,y,c)\  <  T  (1) 

where  I(x,y,c)  is  the  current  image  pixel  intensity  value  at 
( x,y )  and  lMed(x,y,  c)  is  the  spatio-temporal  median  value 
at  pixel  ( x ,  y)  in  the  current  image,  c, 

I  Med  —  Median{I  t,  -•••,  G, ...,  I,  t  } 

Otsu  thresholding  can  be  used  to  adaptively  compute  the 
threshold  value  r  for  each  time  step.  The  most  important 
advantage  that  median  filter  has  over  other  well-known  back¬ 
ground  modeling  techniques  such  as  mixture  of  Gaussians 
[9],  Kalman  filter  [10]  or  Wallflower  [11],  is  that  it  avoids 
blending  pixel  values.  The  output  of  the  median  corresponds 
to  one  of  the  values  in  the  neighborhood  and  it  does  not 
create  unrealistic  pixel  values  out  of  the  image  content  [12]. 
However,  the  median  operation  is  computationally  expen¬ 
sive  which  limits  its  application  for  real-time,  large  scale, 
high  resolution  imagery.  In  this  paper  we  propose  a  fast 
implementation  for  background  estimation  model  using  3D 
median  filtering  for  motion  detection,  based  on  the  integral 
histogram  and  utilizing  many  core  Graphics  Processing  Unit 
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Figure  1:  Spatio-temporal  median  filter  sliding  window 
operation,  (a)  3D  box  median  filtering  [28],  (b)  2D+t 
median  filtering 

(GPU)  architectures  using  the  CUDA  programming  model. 

Fast  and  approximate  foreground  motion  estimation  is 
quite  useful  for  a  number  of  video  analysis  and  tracking  ap¬ 
plications  including  video  summarization  [13,  14,  15],  track¬ 
ing  in  wide  area  imagery  [16,  17,  18,  19],  object  detection 
and  tracking  in  biomedical  applications  [20,  21,  22,  23,  24]. 
Our  previous  work  in  parallelizing  motion  detection  include 
[25,  26,  27]. 

The  main  contributions  of  this  paper  include  implemen¬ 
tation  of  an  integral  histogram-based  multi-scale  3D  median 
filter  that  can  handle  large  temporal  windows,  design  of  an 
efficient  data  structure  for  integral  histogram,  its  layout  in 
GPU  memory,  and  optimization  of  the  kernel  configuration 
to  maximize  the  resource  utilization  of  the  GPUs  and  to 
minimize  the  data  movement.  In  the  case  of  spatio-temporal 
median  filter  T  + 1,  image  arrays  are  transferred  to  the  GPU 
using  double  buffering.  Meanwhile  the  individual  integral 
histograms  are  computed  for  each  image  array  and  the  joint 
integral  histogram  is  computed  for  T  +  1  individual  inte¬ 
gral  histograms.  Finally,  the  medians  are  computed  for  each 
pixel  using  the  joint  local  integral  histograms  of  kernel  size 
mxnin  parallel  which  results  in  high  GPU  utilization.  Re¬ 
sults  show  the  frame  rate  of  the  GPU  implementation  versus 
the  CPU  implementation. 

Section  2  presents  a  short  review  of  the  spatio-temporal 
median  filter  based  on  histogram.  Section  3  includes  the 
proposed  spatio-temporal  median  filtering  based  on  GPU  in¬ 
tegral  histogram,  followed  by  experimental  results  and  con¬ 
clusions. 

2.  SPATIO-TEMPORAL  MEDIAN  FILTER 

Several  techniques  are  typically  used  to  perform  3D  spatio- 
temporal  median  filtering.  Figure  1  shows  two  types  of  slid¬ 
ing  window  operations  for  computing  the  3D  median  filter 
[28].  Having  a  three-dimensional  array  mask  (Fig.  1(a))  with 
size  m  x  n  x  (T  +  1),  the  real  spatio-temporal  median  value 
for  each  pixel  at  (x,  y)  of  current  frame  z,  is  obtained  by 

^3D(x,y,z)=  Median  {/(x  +  Ax,  y  +  Ay,  z  +  At)} 

'i 

Aj/e[-  V-Vl 
Ate[-f  i] 

(2) 


Algorithm  1  3D  Median  Filtering  Based  on  Histogram 

Input  :  Vector  v  stores  pixel  values  located  in  the  mask  filtering 
box  with  size  m  X  n  X  (T  +  1)  centered  at  ( x ,  y) 

Output  :  med  returns  the  median  of  v 
1:  Histv  =  Hist{y ) 

2:  cumsum  =  0  / /cumulative  function  of  the  histogram 
3:  for  med=0:bins  do 
4:  if  Histv(med)  >  0  then 

5:  cumsum+  —  Histv(med) 

6:  if  cumsum  V  mXn*(T+1)  then 

7:  break; 

8:  end  if 

9:  end  if 

10:  end  for 

Return  med 


where  I(x  +  Ax,  y  +  Ay,  z  +  At)  are  pixel  values  located  in 
the  mask  kernel  with  size  m  x  n  centered  at  (x,  y)  over  the 
previous  and  following  T  frames  around  the  current  frame. 
Median (.)  is  the  median  operation  which  can  be  sort-based 
or  selection-based. 

A  fast  separable  approximation  to  the  3D  spatio-temporal 
median  filter  is  to  decouple  the  spatial  and  temporal  median 
computation  which  we  refer  to  as  M2D+t •  In  this  case,  me¬ 
dian  computations  are  separated  into  two  steps:  first,  spatial 
2D  median  filtering  using  an  m  x  n  kernel  size  is  applied  to 
each  image;  then  the  temporal  median  at  each  pixel  (x,  y)  is 
obtained  by  finding  the  median  over  T  +  1  spatial  medians 
(Fig.  1(b)).  Note  that  this  is  an  approximation  of  the  3D 
median  across  a  block  of  m  x  n  x  (T  +  1)  pixel  values  in¬ 
side  the  filtering  3D  kernel.  The  fast  separable  median  filter 
approximation  for  the  stationary  background  estimate  is, 

M2D+t{x,y,z)  =  (3) 

Median  {  Median  {I (x -\- Ax ,  y -\- Ay ,  z -\- At)}} 

a 

The  key  point  is  that  all  of  the  spatial  median  operations 
are  computationally  independent  which  can  be  executed  in 
parallel  to  improve  the  performance. 

2.1  Median  Operation 

Median  filtering  techniques  can  be  grouped  as  sorting- 
based  or  histogram-based.  The  sort-based  approaches  need 
to  generate  a  list  of  ordered  data;  then  the  median  value 
is  the  middle  value  of  the  sorted  list.  Sorting  is  computa¬ 
tionally  expensive  and  as  the  size  of  3D  median  filtering  box 
increases,  the  computational  cost  of  sorting  increases  sig¬ 
nificantly.  In  the  best  case,  the  complexity  is  0(r3)  using 
bucket  sort  [29],  where  r  is  the  filtering  kernel  radius.  But  it 
is  still  not  practical  especially  in  the  case  of  large-scale  high 
resolution  image  sequences. 

The  time  complexity  of  the  histogram  based  median  com¬ 
putation  is  much  more  efficient.  The  classic  Huang’s  fast 
2D  median  filtering  algorithm  has  O(r)  complexity  [30].  Re¬ 
cently,  Perreault  presents  a  simple,  fast  median  filter  sequen¬ 
tial  implementation  that  has  0(1)  constant  time  [29].  In 
general,  once  the  cumulative  histogram  of  the  median  mask 
filter  kernel  has  been  calculated  for  a  pixel,  the  median  value 
is  the  first  bin  histogram  which  has  a  greater  or  equal  value 
than  the  median  index.  The  median  index  is  defined  as 
half  of  the  median  mask  filter  kernel  (i.e.  box)  size.  Algo¬ 
rithm  1  shows  the  spatio-temporal  median  filter  using  the 
histogram  approach.  The  use  of  integral  histograms  make 
the  median  filter  computation  even  faster  by  providing  the 
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Figure  2:  (a)  Computation  of  the  histogram  up  to  loca¬ 
tion  (x,  y )  using  a  cross- weave  horizontal  and  vertical  scan 
on  the  image,  (b)  Computation  of  the  histogram  for  an 
arbitrary  rectangular  region  R  (origin  is  the  upper-left 
corner  with  y-axis.) 

local  histograms  of  an  arbitrary-sized  box  kernel  in  constant 
time  to  find  the  medians  or  even  other  statistical  moments 
like  mean,  variance  and  standard  deviation. 

3.  SPATIO-TEMPORAL  MEDIAN  FILTER 
BASED  ON  GPU  INTEGRAL  HISTOGRAM 

We  propose  a  fast  median  operation  based  on  integral 
histograms  and  its  parallel  implementation  utilizing  many 
core  Graphics  Processing  Unit  (GPU)  architectures  using 
the  CUDA  programming  model  for  motion  detection. 

3.1  Integral  Histogram 

The  integral  histogram  is  a  recursive  propagation  prepro¬ 
cessing  method  used  to  compute  local  histograms  over  arbi¬ 
trary  rectangular  regions  in  constant  time  [31,  32].  The  effi¬ 
ciency  of  the  integral  histogram  approach  enables  real-time 
histogram-based  exhaustive  search  in  vision  tasks  [33,  34, 

35].  The  integral  histogram  is  extensible  to  higher  dimen¬ 
sions  and  different  bin  structures.  The  integral  histogram 
at  position  ( x ,  y )  in  the  image  holds  the  histogram  for  all 
the  pixels  contained  in  the  rectangular  region  defined  by  the 
top-left  corner  of  the  image  and  the  point  ( x ,  y)  as  shown  in 
Figure  2.  The  integral  histogram  for  the  region  defined  by 
the  spatial  coordinate  ( x ,  y)  and  bin  variable  b  is  defined  as: 

x  y 

H(x,y,b)  =  EE  Q(I(r,c),b)  (4) 

r=0  c= 0 

where  Q(/(r,  c),b)  is  the  binning  function  that  evaluates  to 
1  if  /(r,  c)  G  b  for  the  bin  6,  and  evaluates  to  0  otherwise. 
Sequential  computation  of  integral  histograms  is  described 
in  Algorithm  2.  Given  the  image  integral  histogram  H,  com¬ 
putation  of  the  histogram  for  any  test  region  R  delimited  by 


Algorithm  2  Sequential  Integral  Histogram 
Input  :  Image  I  of  size  h  x  w 

Output  :  Integral  histogram  tensor  IH  of  size  h  x  w  x  b 

1:  Initialize  IH: 

IH  <-  0 

IH(I(w,  h),  w,  h)  1 
2:  for  z=l:b  do 
3:  for  x=l:w  do 

4:  for  y=l:h  do 

5:  IH(*,  y,  z)  i -  IH(a;  -  1,  y,  z)  +  IH(*,  y-l,z) 

— IH(a;  -  l,y  -  l,z)  +  Q(I(x,y),z) 

6:  end  for 

7:  end  for 

8:  end  for 

Return  IH 


Algorithm  3  GIH-STS:  GPU  Integral  Histogram  Calculation 
Using  Scan-Transpose-Scan 

Input  :  Image  I  of  size  h  X  w,  number  of  bins  b 
Output  :  Integral  histogram  tensor  IH  of  size  b  x  h  X  w 

1:  Initialize  IH 

IH  <-  0 

IH(I(w,  h),  w,  h)  4 —  1 
2:  for  all  b  x  h  blocks  in  parallel  do 
3:  / /horizontal  cumulative  sums 

4:  Prescan(IH) 

5:  end  for 

6:  //transpose  the  histogram  tensor 
IHT  3D_Transpose(IH) 

7:  for  all  b  x  w  blocks  in  parallel  do 
8:  //vertical  cumulative  sums 

9:  Prescan(IHT) 

10:  end  for 

Return  3D-Transpose(IHT) 


points  {(r  ,c  ),(r  ,  c+),  (r+ ,  ,  c  )}  reduces  to  the 

combination  of  four  integral  histograms: 

h{R,b)=  H(r+  ,c+  ,b)  —  H(r~  ,c+  ,b) 

-H(r+,c-,b)  +  H(r-,c~,b)  (5) 

Figure  2  illustrates  the  notation  and  accumulation  of  integral 
histograms  using  vertical  and  horizontal  cumulative  sums 
(prescan),  which  is  used  to  compute  regional  histograms. 

However,  the  intrinsic  parallel  characteristics  of  the  inte¬ 
gral  histogram  motivates  further  improvements  in  speed  up 
by  parallelizing  it  for  high  performance  computing  configu¬ 
rations.  Bellens,  et  al.  developed  parallel  implementations 
of  the  integral  histogram  for  Cell/B.E.  processors  [32]  and 
we  describe  a  GPU  version  in  [36]. 

3.2  Parallelized  Integral  Histogram 

The  cross- weave  scan  mode  (Fig.  2),  enables  cumulative 
sum  tasks  over  rows  (or  columns)  to  be  scheduled  and  exe¬ 
cuted  independently  allowing  for  inter-row  and  column  par¬ 
allelization.  In  fact  the  integral  histogram  computations  can 
be  divided  into  two  prescans  over  the  data.  First,  a  horizon¬ 
tal  prescan  that  computes  cumulative  sums  over  rows  of  the 
data,  followed  by  a  second  vertical  prescan  that  computes 
cumulative  sums  over  the  columns  of  the  first  scan  output. 
Taking  the  transpose  of  the  horizontally  prescanned  image 
histogram,  enables  us  to  reapply  the  same  (horizontal)  pres¬ 
can  algorithm  effectively  on  the  columns  of  the  data. 

An  image  with  dimensions  h  x  w  produces  an  integral  his¬ 
togram  tensor  of  dimensions  hxwxb ,  where  b  is  the  number 
of  bins  in  the  histogram.  We  have  mapped  the  histogram 
tensor  to  a  ID  row  major  ordered  array  for  efficient  access 
as  shown  in  Figure  3.  In  order  to  reduce  the  communication 
overhead  between  host  and  device,  data  transfer  is  limited  to 
send  an  image  to  the  GPU  and  the  final  integral  histogram 
tensor  back  to  the  CPU.  There  is  no  other  data  back  and 


r 


Figure  3:  Integral  histogram  tensor  represented  as  3D 
array  data  structure  (left),  and  equivalent  ID  array  map¬ 
ping  (right). 
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# define  BLOCK_DIM  32 

void  Transpose3D  (int*  TArray,  int*  Array,  int  w,  int  h,  int  b) 

{  dlm3  grid (b,  w/BLOCK_DIM,  h/BLOCK_DIM) ; 
dim3  threads (BL0CK_DIM  ,BL0CK_DIM  ,1); 

3DTranspose_kernel  «<  grid, threads  »>  (T_Array,  Array,  w,  h) ; 

} 

_ global _  void  3DTranspose_kernel  (int  *odata,  int  *idata,  int  width,  int  height) 

{ 

_ shared _  int  block  [BL0CK_DIM]  [BL0CK_DIM+1J.; 

/ /  read  the  matrix  tile  into  shared  memory 
unsigned  int  xlndex  =  blockldx.y  *  BL0CK_DIM  +  threadldx.x; 

unsigned  int  ylndex  =  blockldx.z  *  BL0CK_DIM  +  threadldx.y; 

if  (  (xlndex  <  width)  &&  (ylndex  <  height)  ) 

(unsigned  int  index_in  =  blockldx.x  *  width  *  height  +  ylndex  *  width  +  xlndex; 
block [threadldx.y] [threadldx.x]  =  idata [index_in] ; 

} 

_ syncthreads () ; 

//  write  the  transposed  matrix  tile  to  global  memory 
xlndex  =  blockldx.z  *  BL0CK_DIM  +  threadldx.x; 
ylndex  =  blockldx.y  *  BL0CK_DIM  +  threadldx.y; 
if  (  (xlndex  <  height)  &&  (ylndex  <  width)  ) 

[unsigned  int  index_out  =  blockldx.x  *  width  *  height  +  ylndex  *  height  +  xlndex; 
odata[index_out|  »  block [threadldx.x] [threadldx.y] ; 

} 

} 


Figure  4:  CUDA  3D  transpose  kernel  code 

forth  between  CPU  and  GPU  until  the  integral  histogram 
calculations  have  been  complected  on  the  GPU. 

Our  GPU  implementation  of  integral  histogram  [36]  which 
is  described  in  Algorithm  3  combines  cross- weave  scan  mode 
with  an  efficient  parallel  prefix  sum  operation.  Harris,  et 
al.  present  an  efficient  implementation  of  all-prefix- sums 
operation  using  the  CUDA  programming  model  [37],  that 
can  be  used  to  implement  Algorithm  3.  For  an  array  of 
size  n  the  prescan  operation  requires  only  O(n)  operations: 

2  x  (n  —  1)  additions  and  (n—  1)  swaps.  We  apply  prefix-sums 
to  the  rows  of  the  histogram  bins  (horizontal  cumulative 
sums  or  prescan),  then  transpose  the  array  and  reapply  the 
prescan  to  the  rows  to  obtain  the  integral  histograms.  In 
order  to  allow  a  single  transpose  operation,  we  extended 
the  optimized  2-D  transpose  kernel  described  in  [38]  to  a  3D 
transpose  kernel  by  using  the  bin  offset  in  the  indexing  (Fig. 
4).  The  optimized  transpose  kernel  uses  zero  bank  conflict 
shared  memory  and  guaranties  that  global  reads  and  writes 
are  coalesced. 

Our  GPU  integral  histogram  implementation  benefits  from 
minimum  data  transfer  between  CPU  and  GPU,  double  buffer¬ 
ing  and  high  GPU  utilization.  The  number  of  threads  is 
automatically  determined  based  on  the  image  size  to  ensure 
maximum  occupancy  per  kernel  (for  more  info  refer  to  [36]). 

3.3  3D  Median  Filter  Based  on  Integral  His¬ 
togram 

The  integral  histogram  of  an  image  provides  the  local  in¬ 
tegral  histogram  of  every  arbitrary  target  region  in  constant 
time  (equation  5).  Taking  advantages  of  this  property  en- 


Figure  5:  Updating  the  joint  integral  histogram  for  me¬ 
dian  filtering. 


Algorithm  4  3D  Median  Filter  Based  on  Integral  Histogram 

Input  :  Image  sequences  I  [k]  of  size  h  X  w, 
number  of  bins  b , 
size  of  image  history  T  +  1 
Output  :  Medians  M[k  —  T]]  of  size  h  x  w 

1:  Initialize  JointIH 

2:  IH_tail=JIH=IntegraLHist(Quantized(image(l))); 

//Compute  the  first  joint  integral  histogram  for  the  first 
T  +  1  frames 

3:  for  Fr  —  2  :  T  +  1  do 

4:  IH[Fr\  =  Integral_Hist{Quantized{image{Fr)))\ 

5:  JIH  =  JIH  +  IH[Fr]; 

6:  end  for 

7:  // Calculate  the  Median  of  current  frame 
8:  for  Fr  =  T  +  2  :  k  do 
9:  //Update  the  IH_head 

1 0 :  IH_head=Int  egr  aLHist  (Quantized  (image  (Fr) ) ) ; 

11:  //Update  the  Joint  Integral  Histogram 

12:  JIH=  JIH  +  IH_head  -  IH_tail; 

13:  / /Update  the  IH_tail 

14:  IH_tail=IntegraLHist(Quantized(image(Fr-T))); 

15:  //Compute  Median  [Fr  —  ^] 

16:  Median  [Fr  —  ^]  =  ComputeLocalMedian(JIH) 

17:  end  for 


ables  us  to  compute  the  medians  in  constant  time  for  all 
target  pixels  using  its  local  integral  histogram.  Algorithm  4 
shows  spatio-temporal  median  computations  using  the  inte¬ 
gral  histogram  (  assuming  image  sequences  of  size  k  which 
are  transferred  to  the  GPU  using  double  buffering  and  a 
spatio-temporal  median  filter  of  size  mXnx(T+l)).  In 
the  initialization  phase,  the  individual  integral  histograms 
are  computed  for  each  image  array.  Meanwhile  the  joint  in¬ 
tegral  histogram  is  computed  for  the  first  T  +  1  individual 
integral  histograms  (assume  that  T  is  an  even  number).  Af¬ 
ter  creating  the  joint  integral  histogram,  the  median  calcu¬ 
lation  phase  started  for  frame  +2.  In  each  iteration,  first, 
the  joint  integral  histogram  is  updated  by  adding  the  inte¬ 
gral  histogram  of  head  image  and  subtracting  the  integral 
histogram  of  tail  image  (Fig.  5).  Then  medians  are  com¬ 
puted  for  each  pixel  using  the  joint  local  integral  histograms 
of  kernel  size  m  X  n  where  local  kernel  integral  histograms 
can  be  obtained  in  0(1). 

4.  EXPERIMENTS 

We  implemented  and  evaluated  background  subtraction 
using  spatio-temporal  median  filter  based  on  CPU  and  par¬ 
allel  GPU  integral  histogram  implementations.  Our  experi¬ 
ments  were  conducted  on  a  2.0  GHz  Quad  Core  Intel  CPU 
(Core  i7-2630QM)  and  two  GPU  cards:  an  NVIDIA  Tesla 
C2070  and  an  NVIDIA  GeForce  GTX  460.  The  former  is 
equipped  with  fourteen  32-core  SMs  and  has  about  5GB 
of  global  memory,  48KB  shared  memory  with  compute  ca¬ 
pability  2.0.  The  latter  consists  of  seven  48-core  SM  and 
is  equipped  with  1GB  global  memory,  48KB  shared  mem¬ 
ory  with  compute  capability  2.1.  The  parallel  GPU  integral 
histogram  implementation  is  based  on  three  kernel  invoca¬ 
tions:  a  single  horizontal  scan,  a  3D  transpose,  and  a  ver¬ 
tical  scan.  The  integral  histogram  computations  start  after 
transferring  the  image  to  the  GPU,  complete  the  calcula¬ 
tion  of  the  integral  histogram  on  the  GPU  then  transfer  the 
final  integral  histogram  tensor  back  to  the  CPU  (this  has 
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Figure  6:  UP:  Frame  rate  of  our  3D  median  filter  based  on  GPU  integral  histogram  and  CPU-only  integral  histogram 
implementations:  (UL)  frame  rate  for  different  image  sizes  and  16  bins,  (UR)  frame  rate  for  512x512  image  size  and 
different  number  of  bins;  Bottom:  Speedup  of  proposed  3D  median  filter  based  on  GPU  integral  histogram  over  CPU 
using  two  NVIDIA  graphic  cards,  (BL)  Speedup  for  different  image  sizes  and  16  bins  with  respect  to  CPU-only,  (BR) 
Speedup  with  varying  number  of  bins  for  512x512  image  size. 


Figure  7:  Frame  rate  performance  comparison  of  the 
GPU  design  versus  CPU  implementation  using  different 
degrees  of  multi-threading,  CPU1,  CPU8,  CPU16  and 
Cell/B.E.  performance  results  presented  for  wavefront 
scan  mode  using  8  SPEs  in[32]. 


the  least  communication  overhead  and  the  most  GPU  uti¬ 
lization).  Figure  6  summarizes  the  frame  rate  performance 
and  speed-up  of  the  GPU  implementations  compared  to  the 
sequential  CPU-only  implementation.  The  frame  rate  is  de¬ 
fined  as  the  maximum  number  of  images  processed  per  sec¬ 
ond.  Since  we  use  double  buffering,  the  frame  rate  equals 
(, kernel-execution-time )_1  for  compute  bound  cases,  or 
(i data_transfer_time)- 1  for  data-transfer  bound  cases.  Con¬ 
sidering  double  buffering  timing,  our  3D  median  filter  based 
on  GPU  integral  histogram  achieves  194  fr/sec  to  compute 
16-bin  integral  histogram  for  a  512  x  512  image  (21  times 
faster  than  CPU-only  implementation)  and  60  times  faster 
than  the  CPU  version  for  IK  x  IK  image  reaching  49  fr/sec 
using  the  NVIDIA  Tesla  C2070  GPU.  Figure  7  compares  the 
frame  rate  performance  of  our  GPU  integral  histogram  im¬ 
plementation  using  different  GPU  hardware,  with  a  sequen¬ 
tial  CPU  implementation  using  different  degrees  of  multi¬ 
threading:  CPU1  (1  thread),  CPU8  (8  threads),  CPU16  (16 
threads).  We  have  compared  our  performances  to  the  best 
performance  of  the  IBM  Cell/B.E  integral  histogram  paral¬ 
lel  implementation  computation  using  wavefront  scan  mode 
[32].  In  most  cases,  our  performance  is  data-transfer-bound: 
the  achieved  speedup  is  limited  by  transferring  the  data  be¬ 
tween  CPU  and  GPU  over  the  PCI-express  interconnect, 
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Figure  8:  Background  subtraction  for  a  sample  DARPA  VIVID  stabilized  aerial  sequence  using  3D  median  filter 
based  on  GPU  integral  histogram  using  different  number  of  bins,  from  left  column  to  the  right  column:  original  image, 
foreground  using  16  bins  integral  histogram,  foreground  using  128  bins  integral  histogram,  foreground  using  256  bins 
integral  histogram. 


rather  than  from  the  parallel  execution  on  the  GPU.  Figure 
8  shows  sample  background  subtraction  results  for  a  DARPA 
VIVID  stabilized  aerial  sequence  using  the  3D  median  filter 
based  on  GPU  integral  histogram  using  different  number 
of  bins  (16,  128  and  256).  Figure  9  shows  sample  back¬ 
ground  subtraction  results  for  two  image  modalities  visible 
and  IR  from  the  PETS2013  Benchmark  background  dataset 
(first  three  rows)  [39],  and  indoor  hallway  motion  from  the 
OTCBVS  Benchmark  Dataset  [40]  (last  three  rows),  using 
the  3D  median  filter  based  on  GPU  integral  histogram  with 
varying  number  of  bins  (16,  128  and  256).  These  results  sug¬ 
gest  that  in  general  for  natural  images  with  large  dynamic 
ranges  using  a  lower  number  of  bins  in  computing  the  in¬ 
tegral  histogram  does  not  adversely  degrade  the  detection 
results. 

5.  CONCLUSIONS 

Although  the  median  filter  is  widely  used  in  computer  vi¬ 
sion  tasks,  it  is  computationally  expensive  which  limits  its 
application  for  large  scale,  high  resolution  imagery  and  real¬ 
time  requirements.  We  showed  that  an  efficient  GPU-based 


parallel  implementation  of  the  spatio-temporal  median  filter 
can  be  achieved  using  integral  histograms.  The  median  filter 
based  on  GPU  integral  histograms  not  only  takes  advantage 
of  fast  GPU  processing  but  also  enables  us  to  compute  the 
median  across  multiple  scales  for  all  target  pixels  in  constant 
time.  Results  show  the  frame  rate  of  the  GPU  implementa¬ 
tion  was  60  times  faster  than  the  CPU  version  for  a  IK  x  IK 
image  reaching  49  fr/sec  and  21  times  faster  for  512  x  512 
frame  sizes  reaching  194  fr/sec.  This  opens  up  the  possibil¬ 
ity  for  speeding  up  a  variety  of  computer  vision  detection 
tasks. 
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Figure  9:  Background  subtraction  for  PETS2013  Benchmark  background  dataset  (first  row  frame  73,  second  row  frame 
161,  third  row  frame  195  )  [39],  an  indoor  hallway  motion  (seq.  irinOl)  from  OTCBVS  Benchmark  IR  Database(row 
four  frame  5150,  row  five  frame  8290  and  row  six  frame  8329)  [40],  and  outdoor  motion  and  tracking  scenarios  (seq. 
irwOl)  from  OTCBVS  Benchmark  IR  Database  (row  seven  frame  332,  row  eight  frame  399,  row  nine  frame  480)  [40] 
using  3D  median  filter  based  on  GPU  integral  histogram  using  different  number  of  bins,  from  left  column  to  the  right 
column:  original  image,  background  model  (256  bins),  foreground  using  16  bins  integral  histogram,  foreground  using 
128  bins  integral  histogram,  foreground  using  256  bins  integral  histogram. 
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